Welcome to the LAST part (part 3) of MLRS-101 !!

Quick review of what we’ve done for Part 1 and 2.

  1. Introduction (Hello World!)
  2. Install R-studio

– We successfully installed R-studio and familiarized ourselves with its interface.

  1. Yes, you are ready
  2. Data preparation + visualization

– We successfully learned how to visualize data as desired (including data cleaning and manipulation).

For Part 3, we’ll dive into why the Hawks lost.

  1. Analysis of the observations
  2. Good luck :)

Last session we are able to create the figure that’s showing how score changes over time. That’s an observation with the result, when you show this figure RM will ask you “why did that happen?”. So let’s answer his question.

4. Analysis of the observations

1) Why is that happened?

We’ve seen that the Hawks lost the game. As a curious researcher, you might wonder why they lost or which quarter was their downfall against the Knicks. Let’s analyze their shot selection (2-pointers like dunks, jump shots, and layups, or 3-pointers) and success rates throughout each quarter. This will help us identify potential weaknesses.

select data and create a new column

In order to analyze our data we need some help from the packages that we installed.

Remember: When you turn off your Nintendo Switch and revisit Super Mario the next day, you don’t need to repurchase the game, but you do need to restart it, right? Similarly, in R, we need to “load the game” - re-import the libraries needed for analysis - before working with our saved DataFrame. Let’s bring our level 10 Mario game (the NBA_19_20_SAF_March_11 data) back to our machine.

# Load the tidyverse package (game)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the ggplot2 package (game)
library(ggplot2)
# Load the ggpmisc package (game)
library(ggpmisc)
## Warning: package 'ggpmisc' was built under R version 4.3.3
## Loading required package: ggpp
## Registered S3 methods overwritten by 'ggpp':
##   method                  from   
##   heightDetails.titleGrob ggplot2
##   widthDetails.titleGrob  ggplot2
## 
## Attaching package: 'ggpp'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## 
## Registered S3 method overwritten by 'ggpmisc':
##   method                  from   
##   as.character.polynomial polynom
# Load the gganimate package (game)
library(gganimate)
# Load the animation package (game)
library(animation)
# Load the animation package (game)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
## bring the dataframe as "NBA_19_20_SAF_March_11" with readRDS() function
NBA_19_20_SAF_March_11 <- readRDS("~/Downloads/NBA_19_20_SAF_March_11.rds")

2) Calculate Shot Rate

Let’s determine the shot frequency for each shot type (excluding free throws) based on the ShotType and ShotOutcome columns.

## Count how many shots and results they tried per 2pts - dunk, jump shot, layup or 3pts
shot_type_with_outcome_count <- NBA_19_20_SAF_March_11 %>% ## from NBA_19_20_SAF_March_11 data
                  group_by(ShotType, ShotOutcome) %>%  
                  summarize(count = n()) ## we are going to count how many events happened based on two columns (ShotType, ShotOutcome)
## `summarise()` has grouped output by 'ShotType'. You can override using the
## `.groups` argument.
## Count how many shots they tried per 2pts - dunk, jump shot, layup or 3pts
shot_type_count <- NBA_19_20_SAF_March_11 %>%
                  group_by(ShotType) %>%
                  summarize(total_count = n()) ## we are going to count how many events happened based on ShotType columns

compute_the_shot_rate <- left_join(shot_type_with_outcome_count, shot_type_count, by = "ShotType")

compute_the_shot_rate$rate <- compute_the_shot_rate$count / compute_the_shot_rate$total_count *100

compute_the_shot_rate
## # A tibble: 9 × 5
## # Groups:   ShotType [5]
##   ShotType         ShotOutcome count total_count  rate
##   <chr>            <chr>       <int>       <int> <dbl>
## 1 ""               ""            332         332 100  
## 2 "2-pt dunk"      "make"         20          23  87.0
## 3 "2-pt dunk"      "miss"          3          23  13.0
## 4 "2-pt jump shot" "make"         31          68  45.6
## 5 "2-pt jump shot" "miss"         37          68  54.4
## 6 "2-pt layup"     "make"         22          40  55  
## 7 "2-pt layup"     "miss"         18          40  45  
## 8 "3-pt jump shot" "make"         25          69  36.2
## 9 "3-pt jump shot" "miss"         44          69  63.8

While this code provides overall shot rates, we are interested in analyzing team-specific performance.

This part will require additional critical thinking. The appropriate methods will depend on both the available data and the specific results we aim to achieve.

3) Filter Data by Team

To start, let’s separate the data into Knicks and Hawks plays using the subset() function. The logic is that if the AwayPlay column is not empty, it’s a Knicks play.

play_by_nyk <- subset(NBA_19_20_SAF_March_11, NBA_19_20_SAF_March_11$AwayPlay != "" )

play_by_atl <- subset(NBA_19_20_SAF_March_11, NBA_19_20_SAF_March_11$HomePlay != "" )

Verify that the data is correctly separated:

unique(play_by_nyk$HomePlay)
## [1] ""
unique(play_by_atl$AwayPlay)
## [1] ""

We’ll now calculate shot rates for both teams. However, manually repeating this process for multiple teams is inefficient. In the next step, we’ll create a function to automate the calculation.

4) Creating a Reusable Function

Let’s create a function named compute_shot_rate_per_quarter to efficiently calculate shot rates for each quarter. This function takes a DataFrame (df) as input and returns the calculated shot rates.

# create a compute the success rate
compute_shot_rate_per_quarter <- function(df){
  ## Count how many shots and results they tried per 2pts - dunk, jump shot, layup or 3pts
  shot_type_with_outcome_count <- df %>% ## from NBA_19_20_SAF_March_11 data
                    group_by(Quarter, ShotType, ShotOutcome) %>%  
                    summarize(count = n()) ## we are going to count how many events happened based on two columns (ShotType, ShotOutcome)
  
  ## Count how many shots they tried per 2pts - dunk, jump shot, layup or 3pts
  shot_type_count <- df %>%
                    group_by(Quarter, ShotType) %>%
                    summarize(total_count = n()) ## we are going to count how many events happened based on ShotType columns
  
  compute_the_shot_rate <- left_join(shot_type_with_outcome_count, shot_type_count, by = c("Quarter", "ShotType") )
  
  compute_the_shot_rate$rate <- compute_the_shot_rate$count / compute_the_shot_rate$total_count *100
  
  # Filter out only they made
  compute_the_shot_rate <- compute_the_shot_rate %>%
                                subset( compute_the_shot_rate$ShotOutcome == "make" )
  
  return(compute_the_shot_rate)
}

This function replaces the specific DataFrame NBA_19_20_SAF_March_11 with a generic df, making it adaptable to different datasets.

5) Run the function we created

play_by_nyk_with_shot_rate <- compute_shot_rate_per_quarter(play_by_nyk)
## `summarise()` has grouped output by 'Quarter', 'ShotType'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'Quarter'. You can override using the
## `.groups` argument.
play_by_nyk_with_shot_rate$team <- "NYK"
play_by_nyk_with_shot_rate
## # A tibble: 17 × 7
## # Groups:   Quarter, ShotType [17]
##    Quarter ShotType       ShotOutcome count total_count  rate team 
##      <int> <chr>          <chr>       <int>       <int> <dbl> <chr>
##  1       1 2-pt dunk      make            2           2 100   NYK  
##  2       1 2-pt jump shot make            3           6  50   NYK  
##  3       1 2-pt layup     make            5           8  62.5 NYK  
##  4       1 3-pt jump shot make            2           8  25   NYK  
##  5       2 2-pt dunk      make            4           5  80   NYK  
##  6       2 2-pt jump shot make            5          10  50   NYK  
##  7       2 2-pt layup     make            2           3  66.7 NYK  
##  8       2 3-pt jump shot make            3           5  60   NYK  
##  9       3 2-pt dunk      make            5           5 100   NYK  
## 10       3 2-pt jump shot make            3           8  37.5 NYK  
## 11       3 2-pt layup     make            1           2  50   NYK  
## 12       3 3-pt jump shot make            2           5  40   NYK  
## 13       4 2-pt jump shot make            6          13  46.2 NYK  
## 14       4 3-pt jump shot make            2           6  33.3 NYK  
## 15       5 2-pt dunk      make            2           2 100   NYK  
## 16       5 2-pt jump shot make            1           1 100   NYK  
## 17       5 3-pt jump shot make            2           4  50   NYK
play_by_atl_with_shot_rate <- compute_shot_rate_per_quarter(play_by_atl)
## `summarise()` has grouped output by 'Quarter', 'ShotType'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'Quarter'. You can override using the
## `.groups` argument.
play_by_atl_with_shot_rate$team <- "ATL"
play_by_atl_with_shot_rate
## # A tibble: 19 × 7
## # Groups:   Quarter, ShotType [19]
##    Quarter ShotType       ShotOutcome count total_count  rate team 
##      <int> <chr>          <chr>       <int>       <int> <dbl> <chr>
##  1       1 2-pt dunk      make            1           1 100   ATL  
##  2       1 2-pt jump shot make            4           9  44.4 ATL  
##  3       1 2-pt layup     make            1           4  25   ATL  
##  4       1 3-pt jump shot make            3          14  21.4 ATL  
##  5       2 2-pt dunk      make            2           2 100   ATL  
##  6       2 2-pt jump shot make            2           9  22.2 ATL  
##  7       2 2-pt layup     make            4           8  50   ATL  
##  8       2 3-pt jump shot make            3           6  50   ATL  
##  9       3 2-pt dunk      make            2           4  50   ATL  
## 10       3 2-pt jump shot make            3           6  50   ATL  
## 11       3 2-pt layup     make            4           8  50   ATL  
## 12       3 3-pt jump shot make            2           7  28.6 ATL  
## 13       4 2-pt dunk      make            2           2 100   ATL  
## 14       4 2-pt jump shot make            3           4  75   ATL  
## 15       4 2-pt layup     make            3           4  75   ATL  
## 16       4 3-pt jump shot make            5           9  55.6 ATL  
## 17       5 2-pt jump shot make            1           2  50   ATL  
## 18       5 2-pt layup     make            2           2 100   ATL  
## 19       5 3-pt jump shot make            1           5  20   ATL

At the end we want to compare the team-level stats side-by-side. In order to do this we can form a union of two tables to make a single table.

6) Create the table

Let’s restructure the data into a more readable format. We’ll exclude unnecessary columns, pivot the data by quarter, and combine the results for both teams.

  • Excluding three columns (ShotOutcome, count, total_count)

  • pivot our data with Quarter with success rate

  • union two data with rbind() function.

ATL_organized <- play_by_atl_with_shot_rate %>% 
                subset(select = -c(ShotOutcome, count, total_count) ) %>%
                pivot_wider(
                  names_from = Quarter, 
                  values_from = rate
                )

NYK_organized <- play_by_nyk_with_shot_rate %>% 
                subset(select = -c(ShotOutcome, count, total_count) ) %>%
                pivot_wider(
                  names_from = Quarter, 
                  values_from = rate
                )

Agg_ATL_NYK_shot_rate_per_q <- rbind(ATL_organized, NYK_organized)

Agg_ATL_NYK_shot_rate_per_q
## # A tibble: 8 × 7
## # Groups:   ShotType [4]
##   ShotType       team    `1`   `2`   `3`   `4`   `5`
##   <chr>          <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2-pt dunk      ATL   100   100    50   100      NA
## 2 2-pt jump shot ATL    44.4  22.2  50    75      50
## 3 2-pt layup     ATL    25    50    50    75     100
## 4 3-pt jump shot ATL    21.4  50    28.6  55.6    20
## 5 2-pt dunk      NYK   100    80   100    NA     100
## 6 2-pt jump shot NYK    50    50    37.5  46.2   100
## 7 2-pt layup     NYK    62.5  66.7  50    NA      NA
## 8 3-pt jump shot NYK    25    60    40    33.3    50

Let’s view this data in table format.

#making a tab
kable(Agg_ATL_NYK_shot_rate_per_q, row.names = F) %>%
     column_spec (1:6, border_left = T, border_right = T) %>% 
     kable_styling()
ShotType team 1 2 3 4 5
2-pt dunk ATL 100.00000 100.00000 50.00000 100.00000 NA
2-pt jump shot ATL 44.44444 22.22222 50.00000 75.00000 50
2-pt layup ATL 25.00000 50.00000 50.00000 75.00000 100
3-pt jump shot ATL 21.42857 50.00000 28.57143 55.55556 20
2-pt dunk NYK 100.00000 80.00000 100.00000 NA 100
2-pt jump shot NYK 50.00000 50.00000 37.50000 46.15385 100
2-pt layup NYK 62.50000 66.66667 50.00000 NA NA
3-pt jump shot NYK 25.00000 60.00000 40.00000 33.33333 50

It appears that visualizing this information through plotting might provide a clearer understanding than examining the table itself.

As you can see, the dataset “Agg_ATL_NYK_shot_rate_per_q” currently has eight rows and seven columns. However, using numbers as column names is inefficient and can lead to confusion. For instance, referencing a column as “1” might be misinterpreted by the computer. Consider the following code:

ggplot(Agg_ATL_NYK_shot_rate_per_q, aes(ShotType, 1, color = team)) + geom_point(size = 10) +     scale_colour_manual(name="",  
                      values = c("NYK"="orange", "ATL"="red"))

You might assume this plot displays the success rate for each team, but it actually plots the value 1 on the y-axis.

To avoid this issue, we’ll rename the numeric column names to include “Q” for quarter. Let’s examine the current column names:

colnames(Agg_ATL_NYK_shot_rate_per_q)
## [1] "ShotType" "team"     "1"        "2"        "3"        "4"        "5"

As expected, we have seven columns: “ShotType”, “team”, “1”, “2”, “3”, “4”, and “5”. We’ll modify these to “Q1”, “Q2”, “Q3”, “Q4”, and “Q5” while preserving the original order.

colnames(Agg_ATL_NYK_shot_rate_per_q) <- c("ShotType", "team", "Q1", "Q2", "Q3", "Q4", "Q5" )

It’s crucial to maintain the original column order for accurate data analysis.

Alternatively, the gsub() function with regular expressions can be used to rename columns without worrying about order.

…or the data set could have been extracted with “Q” already in the name

ATL_organized <- play_by_atl_with_shot_rate %>% 
                subset(select = -c(ShotOutcome, count, total_count) ) %>%
                pivot_wider(
                  names_from = Quarter,
                  names_prefix = 'Q',
                  values_from = rate
                )

NYK_organized <- play_by_nyk_with_shot_rate %>% 
                subset(select = -c(ShotOutcome, count, total_count) ) %>%
                pivot_wider(
                  names_from = Quarter, 
                  names_prefix = 'Q',
                  values_from = rate
                )

Agg_ATL_NYK_shot_rate_per_q <- rbind(ATL_organized, NYK_organized)

Agg_ATL_NYK_shot_rate_per_q
## # A tibble: 8 × 7
## # Groups:   ShotType [4]
##   ShotType       team     Q1    Q2    Q3    Q4    Q5
##   <chr>          <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2-pt dunk      ATL   100   100    50   100      NA
## 2 2-pt jump shot ATL    44.4  22.2  50    75      50
## 3 2-pt layup     ATL    25    50    50    75     100
## 4 3-pt jump shot ATL    21.4  50    28.6  55.6    20
## 5 2-pt dunk      NYK   100    80   100    NA     100
## 6 2-pt jump shot NYK    50    50    37.5  46.2   100
## 7 2-pt layup     NYK    62.5  66.7  50    NA      NA
## 8 3-pt jump shot NYK    25    60    40    33.3    50

7) Analyzing Shooting Success by Quarter

This section will explore the shooting success of the Atlanta Hawks (ATL) and New York Knicks (NYK) throughout the game, focusing on each quarter and overtime. We’ll utilize visualizations to compare their performance across different shot types (2pt Dunk, 2pt Jump Shot, 2pt Layup, and 3pt Shot).

First Quarter

The opening quarter favored the NYK. Our plot (Figure 1) reveals they had a higher (or equal) success rate for every shot type compared to ATL.

ggplot(Agg_ATL_NYK_shot_rate_per_q, aes(ShotType, Q1, color = team)) +
       geom_point(size = 10) + 
       scale_colour_manual(name="", values = c("NYK"="orange", "ATL"="red")) + 
       scale_shape_manual(values=c(23, 24)) + 
       ggtitle("Figure 1: Q1") + 
       ylab("success rate for every shot")

Second Quarter

The second quarter showcased a more competitive battle. While ATL edged out NYK on 2pt Dunk attempts (Figure 2), NYK dominated in 2pt Jump Shots, Layups, and 3pt Shots.

ggplot(Agg_ATL_NYK_shot_rate_per_q, aes(ShotType, Q2, color = team)) +
       geom_point(size = 10) + 
       scale_colour_manual(name="", values = c("NYK"="orange", "ATL"="red")) + 
       scale_shape_manual(values=c(23, 24)) + 
       ggtitle("Figure 2: Q2") + 
       ylab("success rate for every shot")

Overall First Half

Based on the combined data from the first two quarters (Figures 1 & 2), NYK held a slight advantage in shooting success during the first half.

Third Quarter

The third quarter saw a dip in ATL’s performance, particularly with missed 2pt Dunk opportunities (Figure 3). However, they managed a slight edge in 2pt Layups.

ggplot(Agg_ATL_NYK_shot_rate_per_q, aes(ShotType, Q3, color = team)) +
       geom_point(size = 10) + 
       scale_colour_manual(name="", values = c("NYK"="orange", "ATL"="red")) + 
       scale_shape_manual(values=c(23, 24)) + 
       ggtitle("Figure 3: Q3") + 
       ylab("success rate for every shot")

Fourth Quarter and Overtime

Since the game went into overtime, we anticipated ATL to potentially outperform NYK in the fourth quarter. This was indeed the case! Figure 4 reveals ATL dominated across all shot categories during the fourth quarter.

ggplot(Agg_ATL_NYK_shot_rate_per_q, aes(ShotType, Q4, color = team)) + 
       geom_point(size = 10) + 
       scale_colour_manual(name="", values = c("NYK"="orange", "ATL"="red")) + 
       scale_shape_manual(values=c(23, 24)) + 
       ggtitle("Figure4: Q4") + 
       ylab("success rate for every shot")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Overtime presented a more balanced picture (Figure 5). ATL maintained an edge in 2pt Layups, but NYK held its own in other categories.

ggplot(Agg_ATL_NYK_shot_rate_per_q, aes(ShotType, Q5, color = team)) +
       geom_point(size = 10) + 
       scale_colour_manual(name="", values = c("NYK"="orange", "ATL"="red")) + 
       scale_shape_manual(values=c(23, 24)) + 
       ggtitle("Figure 5: Overtime") + 
       ylab("success rate for every shot")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

8) Beyond Visualizations: Exploring Additional Insights

While these visualizations provide valuable insights, consider incorporating additional data representations like tables. These can offer a more detailed breakdown of shooting success for each team and shot type across all quarters.

9) Conclusion and Next Steps

By analyzing shooting success by quarter, we can identify strengths and weaknesses for each team throughout the game. This analysis can be further enhanced by incorporating additional factors like defensive strategies and player fatigue.

Kudos

Thank you!

4. Good luck :)

Presenting to the Lab

  1. Context and Background

When presenting these findings to your lab, begin by providing a brief introduction. Explain the purpose of your analysis, which could be something like “I wanted to investigate the shooting performance of ATL and NYK throughout the game.”

  1. Explaining Figures

When presenting figures, take a moment to explain the axes, colors, and shapes used. For example, you could say, “The x-axis represents the shot type, the y-axis indicates shooting success rate, and the colors orange and red represent NYK and ATL, respectively.”

**3. Interpretation and Conclusions

Think about the conclusion of your figures / tables and tell people what do you think. This might be the most important step before you present this to the lab members, if you don’t know it’s okay to say I don’t know, but try to think.